global attention
Tailoring Self-Attention for Graph via Rooted Subtrees
Attention mechanisms have made significant strides in graph learning, yet they still exhibit notable limitations: local attention faces challenges in capturing long-range information due to the inherent problems of the message-passing scheme, while global attention cannot reflect the hierarchical neighborhood structure and fails to capture fine-grained local information. In this paper, we propose a novel multihop graph attention mechanism, named Subtree Attention (STA), to address the aforementioned issues. STA seamlessly bridges the fully-attentional structure and the rooted subtree, with theoretical proof that STA approximates the global attention under extreme settings.
cf78a15772ec1a6aee9bbee2d2b382c3-Supplemental-Conference.pdf
Our first step is to prove the parameterization (Eq. 3) provides local attention after the Note that the weight and bias terms in theaboveformulation (Eq. Assume the position-based function at each head is learned to perform'hard attention' on one of its surrounding positions,i.e., an extreme semi-dynamic attention. To demonstrate this phenomenon, we plot and compare the impacts ofฮฆc and ฮฆp6 on ฮฆa in the middle and right of Fig. S4 and visualize learned position-based attentionฮฆp of iRPE in Fig. S5. As seen from Tab. S17, there exist noticeable performance gaps between the models (b, f, g, h) (withoutฮฆp)and(a,d,e,i)(withฮฆp). Without adaptiveattention (model (c)),ฮฆp imposes stronger locality onevery layer.
An A-Z list of 2025's biggest stories
Scroll back through the last year, and the same words come up again and again. The top-trending terms of 2025, from artificial intelligence to Zohran Mamdani, shaped headlines across politics, conflict, technology and climate. As the year comes to a close, AJ Labs has compiled an A to Z list of names, places and issues that generated sustained interest throughout 2025, according to a loose analysis of our own most-viewed story tags and those that appeared in Google's most searched. Taken together, these terms are a patchwork of issues that are also likely to spill into 2026, from ongoing conflicts to a changing technosocial landscape not seen since the dawn of the internet. This is 2025 from A to Z, by the words that made the year.
Representing Long-Range Context for Graph Neural Networks with Global Attention
Graph neural networks are powerful architectures for structured datasets. However, current methods struggle to represent long-range dependencies. Scaling the depth or width of GNNs is insufficient to broaden receptive fields as larger GNNs encounter optimization instabilities such as vanishing gradients and representation oversmoothing, while pooling-based approaches have yet to become as universally useful as in computer vision. In this work, we propose the use of Transformer-based self-attention to learn long-range pairwise relationships, with a novel "readout" mechanism to obtain a global graph embedding. Inspired by recent computer vision results that find position-invariant attention performant in learning long-range relationships, our method, which we call GraphTrans, applies a permutation-invariant Transformer module after a standard GNN module. This simple architecture leads to state-of-the-art results on several graph classification tasks, outperforming methods that explicitly encode graph structure. Our results suggest that purely-learning-based approaches without graph structure may be suitable for learning high-level, long-range relationships on graphs.
RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models
Wang, Bailin, Lan, Chang, Wang, Chong, Pang, Ruoming
Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention -- its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches. We open-sourced our Pallas kernels along with model codes to facilitate further research effort.